Skip to content

[TRTLLM-13543][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1)#15525

Open
chienchunhung wants to merge 2 commits into
NVIDIA:mainfrom
chienchunhung:WideEP-FT/1b.1-reconfigure-masking
Open

[TRTLLM-13543][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1)#15525
chienchunhung wants to merge 2 commits into
NVIDIA:mainfrom
chienchunhung:WideEP-FT/1b.1-reconfigure-masking

Conversation

@chienchunhung

@chienchunhung chienchunhung commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add mask-only EPLB reconfiguration in C++, nanobind, and Python.
  • Make dead-rank slots unreachable and preserve the mask across later dynamic EPLB updates.
  • Fail closed if masking would leave any expert without a live replica.
  • Reject direct reconfiguration while an iteration is active.

Scope

  • Standalone on main; no in-flight PR dependency.
  • HBM/capacity admission remains an integration-layer gate before calling this API.

Validation

  • git diff --check upstream/main..HEAD
  • python -m py_compile tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py tests/unittest/_torch/modules/test_moe_load_balancer.py

Summary by CodeRabbit

  • New Features

    • Introduced capability to reconfigure MOE load balancing when compute ranks become unavailable, with automatic validation to ensure all experts retain at least one active replica.
  • Tests

    • Added tests for mask-only reconfiguration scenarios including replica validation, active iteration protection, and edge case handling.

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55089 [ run ] triggered by Bot. Commit: 7651902 Link to invocation

@chienchunhung chienchunhung changed the title [TRTLLM-12557][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1) [TRTLLM-13543][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1) Jun 22, 2026
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55089 [ run ] completed with state FAILURE. Commit: 7651902
/LLM/main/L0_MergeRequest_PR pipeline #44075 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung requested a review from dongxuy04 June 23, 2026 02:46
@chienchunhung chienchunhung marked this pull request as ready for review June 23, 2026 05:08
@chienchunhung chienchunhung requested a review from a team as a code owner June 23, 2026 05:08
@chienchunhung chienchunhung requested a review from yuxianq June 23, 2026 05:08
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds reconfigureMaskOnly(deadRanks) to MoeLoadBalancer to reconfigure expert placement by masking dead ranks without full rebalancing. The C++ placement core (doReplication, doPlacement) gains an optional dead-rank mask parameter; iteration lifecycle tracking (mIterActive) is introduced; and matching Python bindings and unit tests are added.

Changes

MoeLoadBalancer Mask-Only Reconfiguration

Layer / File(s) Summary
Public API contracts and new member declarations
cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.h
Adds reconfigureMaskOnly, three SingleLayerMoeLoadBalancer helpers, getDeadRankMaskSnapshot, mIterActive boolean, mDeadRankMask/mDeadRankMaskMutex members, and updates doReplication/doPlacement free-function signatures to accept an optional deadRankMask pointer.
Dead-rank mask support in doReplication and doPlacement
cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp
Introduces isRankMasked/getActiveSlotCount utilities. doReplication computes totalSlotCount from active ranks and checks a replica floor. doPlacement pre-fills all rankExpertIds to -1, seeds the priority queue only with unmasked ranks, and reworks GPU slot construction to skip invalid slots and assert assigned counts.
SingleLayerMoeLoadBalancer and MoeLoadBalancer::reconfigureMaskOnly implementation
cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp
Per-layer helpers compute surviving replica counts, validate every expert retains at least one replica, and trigger CPU/GPU placement. Top-level reconfigureMaskOnly locks against concurrent iterations, builds the candidate mask, waits for layer updates, validates all layers, stores the mask, and reconfigures each layer. startIter/endIter now maintain mIterActive; weight-update routines snapshot and forward the mask.
nanobind and Python wrapper
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp, tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
nanobind exposes reconfigure_mask_only(dead_ranks) with GIL release. The Python MoeLoadBalancer.reconfigure_mask_only method guards against active iteration via self.in_iter and delegates to load_balancer_impl.
C++ and Python unit tests
cpp/tests/unit_tests/runtime/moeLoadBalancerTest.cpp, tests/unittest/_torch/modules/test_moe_load_balancer.py
Three C++ tests cover slot invalidation for dead ranks, rejection during active iteration, and rejection when masking the last replica. Two Python tests cover lifecycle forwarding to the mock and RuntimeError on active iteration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding mask-only reconfiguration to Expert Pipeline Load Balancer (EPLB) for WideEP FT, which aligns with the PR's primary functionality.
Description check ✅ Passed The PR description provides a clear summary of changes, scope, and validation steps. However, it lacks explicit test coverage documentation as required by the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp (1)

153-175: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Make the mask-dependent replica-count invariant a runtime check.

Line 159 makes totalSlotCount depend on the active mask, but Line 175 only checks the replica-count sum with assert. In release builds, a stale or mismatched expertReplicaCount can drain pq and make pq.top() undefined behavior. Use TLLM_CHECK_WITH_INFO here.

Proposed fix
-    assert(static_cast<int>(allReplicas.size()) == totalSlotCount);
+    TLLM_CHECK_WITH_INFO(static_cast<int>(allReplicas.size()) == totalSlotCount,
+        "Replica count sum (%ld) must match active slot count (%d)", static_cast<long>(allReplicas.size()),
+        totalSlotCount);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp` around lines
153 - 175, The assertion check at the end of the function that compares
allReplicas.size() to totalSlotCount is only enforced in debug builds. Since
totalSlotCount depends on the active mask and can mismatch with
expertReplicaCount, this invariant violation could cause undefined behavior in
release builds. Replace the assert statement with TLLM_CHECK_WITH_INFO to ensure
this critical runtime check is performed in all build configurations, and
provide a descriptive error message that explains the mismatch between the
expected slot count and actual replica count.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

992-992: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use 3.10-native typing and an explicit return type in the new API signature.

Align this new method with repo typing rules by switching to list[int] and annotating -> None.

♻️ Proposed fix
-    def reconfigure_mask_only(self, dead_ranks: List[int]):
+    def reconfigure_mask_only(self, dead_ranks: list[int]) -> None:

As per coding guidelines, "Always annotate functions with return types" and "Prefer using the built-in types list, dict, and tuple to the legacy typing.List, typing.Dict, and typing.Tuple."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py` at line 992, The
reconfigure_mask_only method signature does not conform to the repository's
typing standards. Update the method signature by replacing the legacy List[int]
type hint with the Python 3.10-native list[int], and add an explicit return type
annotation of -> None to the method definition to align with the coding
guidelines that require all functions to have annotated return types and prefer
built-in generic types over typing module imports.

Source: Coding guidelines

tests/unittest/_torch/modules/test_moe_load_balancer.py (1)

272-304: 📐 Maintainability & Code Quality | 🔵 Trivial

Coverage assessment: sufficient for this PR, with one optional follow-up outside it.

Coverage is sufficient here: tests/unittest/_torch/modules/test_moe_load_balancer.py validates Python forwarding and active-iteration rejection, and cpp/tests/unit_tests/runtime/moeLoadBalancerTest.cpp validates dead-rank masking plus fail-closed behavior. Optional follow-up outside this PR: add one non-mocked Python integration case in tests/unittest/_torch/modules/test_moe_load_balancer.py to verify backend fail-closed exceptions surface end-to-end.

As per path instructions, "tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/test_moe_load_balancer.py` around lines 272 -
304, The current test coverage in test_moe_load_balancer.py is sufficient for
this PR, but there is an optional follow-up suggestion to add one non-mocked
Python integration test case to the test_moe_load_balancer.py file that verifies
backend fail-closed exceptions surface end-to-end. This test should exercise the
MoeLoadBalancer with actual backend behavior (not mocked) to ensure that
exceptions from fail-closed scenarios are properly propagated and not swallowed
or incorrectly handled during the MoeLoadBalancer lifecycle operations.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp`:
- Around line 1127-1134: The mDeadRankMask is being updated before all layers
are reconfigured with the candidateDeadRankMask, which can expose an
inconsistent state if layer->reconfigureMaskOnly() throws an exception. Move the
entire lock_guard block that updates mDeadRankMask to after the for loop
iterating over mLayers completes successfully. This ensures the shared mask is
only published once all per-layer reconfiguration has succeeded, preventing
other threads from observing a partially-applied mask state.
- Around line 60-78: The `getActiveSlotCount` function does not validate that
the `deadRankMask` vector has the correct size before using it. Currently, if
the mask is shorter than `metaInfo.epSize`, ranks beyond the vector size are
treated as active in the `isRankMasked` check, which can leave dead-rank slots
reachable. Add validation at the beginning of `getActiveSlotCount` to ensure
that if `deadRankMask` is provided (not nullptr), its size must equal
`metaInfo.epSize`. If the size does not match, handle the error appropriately
(such as asserting or returning an error indicator).

---

Outside diff comments:
In `@cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp`:
- Around line 153-175: The assertion check at the end of the function that
compares allReplicas.size() to totalSlotCount is only enforced in debug builds.
Since totalSlotCount depends on the active mask and can mismatch with
expertReplicaCount, this invariant violation could cause undefined behavior in
release builds. Replace the assert statement with TLLM_CHECK_WITH_INFO to ensure
this critical runtime check is performed in all build configurations, and
provide a descriptive error message that explains the mismatch between the
expected slot count and actual replica count.

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`:
- Line 992: The reconfigure_mask_only method signature does not conform to the
repository's typing standards. Update the method signature by replacing the
legacy List[int] type hint with the Python 3.10-native list[int], and add an
explicit return type annotation of -> None to the method definition to align
with the coding guidelines that require all functions to have annotated return
types and prefer built-in generic types over typing module imports.

In `@tests/unittest/_torch/modules/test_moe_load_balancer.py`:
- Around line 272-304: The current test coverage in test_moe_load_balancer.py is
sufficient for this PR, but there is an optional follow-up suggestion to add one
non-mocked Python integration test case to the test_moe_load_balancer.py file
that verifies backend fail-closed exceptions surface end-to-end. This test
should exercise the MoeLoadBalancer with actual backend behavior (not mocked) to
ensure that exceptions from fail-closed scenarios are properly propagated and
not swallowed or incorrectly handled during the MoeLoadBalancer lifecycle
operations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8ea43521-47c2-40dc-97de-03fb29f90a56

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed7ce4 and 7651902.

📒 Files selected for processing (6)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp
  • cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.h
  • cpp/tests/unit_tests/runtime/moeLoadBalancerTest.cpp
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
  • tests/unittest/_torch/modules/test_moe_load_balancer.py

Comment thread cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp
Comment thread cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp Outdated
@chienchunhung chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch from 7651902 to 801713d Compare June 23, 2026 17:04
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run

@chienchunhung chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch from 801713d to 088dd5c Compare June 23, 2026 17:06
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55289 [ run ] triggered by Bot. Commit: 088dd5c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55290 [ run ] triggered by Bot. Commit: 088dd5c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55289 [ run ] completed with state ABORTED. Commit: 088dd5c

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55290 [ run ] completed with state FAILURE. Commit: 088dd5c
/LLM/main/L0_MergeRequest_PR pipeline #44241 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "L40S-PyTorch-1,H100_PCIe-CPP-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55343 [ run ] triggered by Bot. Commit: 088dd5c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55343 [ run ] completed with state SUCCESS. Commit: 088dd5c
/LLM/main/L0_MergeRequest_PR pipeline #44292 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch from 088dd5c to 5f391a6 Compare June 24, 2026 20:53

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55581 [ run ] triggered by Bot. Commit: 5f391a6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55581 [ run ] completed with state FAILURE. Commit: 5f391a6
/LLM/main/L0_MergeRequest_PR pipeline #44499 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55614 [ run ] triggered by Bot. Commit: 5f391a6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55614 [ run ] completed with state FAILURE. Commit: 5f391a6
/LLM/main/L0_MergeRequest_PR pipeline #44532 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants